Version: V11

Understanding Summarization in VIDIZMO

Summaries provide a concise overview of your content, allowing viewers to quickly understand the key information without wading through unnecessary details. Using the VIDIZMO Indexer app, you can create summaries for your audio, videos, and documents using various summarization approaches. You can generate summaries based on visuals, transcribed text from audio, or a combination of both. The application also enables you to summarize documents using selectable text.

Each summarization method can be configured directly in the VIDIZMO Indexer and is organized into separate classes. These classes can be added as AI insights for processing and have their own individual consumption category, where their usage can be tracked independently.

The VIDIZMO Indexer also offers additional configuration options to determine how summaries are generated. For instance, you can specify forbidden words to ensure they are not included in the summaries, or adjust settings that control how many frames are processed in visual summarization, helping to balance processing speed and accuracy. For detailed instructions on how to configure the VIDIZMO Indexer for summarization, refer to Configuring VIDIZMO Indexer for Summarization.

System Requirements for AI Summarization Service

To ensure optimal performance of the VIDIZMO AI Summarization service, the following hardware specifications are recommended:

Minimum Requirements

GPU: NVIDIA GPU with 16 GB VRAM
CPU: Minimum 6 cores
Memory: 32 GB RAM
Storage: At least 300 GB (NVMe SSD Recommended)

Recommended Requirements

GPU: NVIDIA GPU with 20+ GB VRAM
CPU: CPU with 8+ cores
Memory: 48 GM RAM
Storage: Atleast 300 GB (NVMe SSD Recommended)

Note: These settings accommodate higher batch sizes (increased VRAM and RAM), faster video decoding (better CPU), and faster video analysis (Faster GPU)

Concept

VIDIZMO summarization involves extracting the essential ideas from your content and creating a coherent and comprehensive summary as the result. The process for summarization varies depending on the content you are working with and the summarization class you've chosen. Below are the classes you can configure in the VIDIZMO Indexer, each designed to generate summaries for specific content types.

Document Summarization

Document summarization generates summaries for your documents by analyzing their text. This approach only works with documents that contain selectable text. To generate summaries, the VIDIZMO Indexer processes the input text to grasp the core ideas or key points, and then reflects them concisely in the output summary.

The generated summary may have a different sentence structure or vocabulary than the original text. However, the application ensures that the core concept or information is preserved and represented clearly and coherently.

Audio Summarization

Audio summarization analyzes the spoken words in your content to create a summary. This process works with audio and video files containing an audio stream with spoken words. The VIDIZMO Indexer first transcribes the audio stream into text which results in the transcriptions of the content being generated. After that, the summarization model then analyzes the transcription text, extracts the keypoints and then produces a coherent and concise summary.

This feature supports various audio and video file formats, including MP3, WAV, MP4, and more. For best results, the audio should have minimal background noise.

A common use case for audio summarization is generating meeting minutes automatically from recorded sessions, enabling participants to quickly catch up on important discussions without listening to the entire audio.

In addition, you can also use audio summarization on content transcribed by other VIDIZMO indexing applications, such as the Azure ARM or AWS Indexer.

Supported Languages for Audio and Document Summarization

For Audio and Document Summarization, the VIDIZMO Indexer generates summaries in the same language as the original text. In cases where the content is multilingual (contains multiple languages), the VIDIZMO Indexer generates the summary in the language it initially detects when analyzing the text obtained from the transcriptions or document. It supports multiple languages, including prominent ones like:

English
German
French
Italian
Portuguese
Hindi
Spanish
Thai
Arabic

Video Summarization Process

Video summarization analyzes visual content to generate summaries. The information is derived from frames extracted from video files, which are then analyzed using the AI model used by the VIDIZMO Indexer. During the preprocessing stage, the frames are gathered into batches or segments, and a summary is generated once all the segments (containing the frames) have been analyzed.

The summary produced by the model depends on the number of frames collected during preprocessing. The VIDIZMO Indexer provides options to control how many frames are gathered for processing, allowing you to adjust factors such as processing speed and summary accuracy.

In the VIDIZMO Indexer configurations, when you select either Video Summarization or Video and Audio Summarization, you can access settings that control how frames are gathered and processed. These options enable you to optimize the summarization process for faster processing or more accurate results.

As of now, video summarization outputs its summary exclusively in English, regardless of the video’s original spoken language. This means if you upload a video with Spanish audio, the summary generated by video summarization alone will still be in English.

Video Sampling Rate

As mentioned above, frames are extracted from videos, queued, and loaded for analysis. The video summarization process then generates summaries based on the frames obtained. The video sampling rate determines how many frames are extracted per second for analysis. You can set the sampling rate between 1 and 8 to control how many frames are loaded each second for processing.

Adjusting the sampling rate directly affects the balance between summary quality and processing efficiency. Increasing the sampling rate enhances the summary’s detail and accuracy by considering more visual information. Decreasing the sampling rate speeds up processing but may lead to a less detailed summary if important visual changes are missed.

A high sampling rate is ideal for fast-paced videos where scenes change frequently. It allows the indexer to extract and process more frames per second, ensuring the summary captures the full context and detail of the video.
A low sampling rate works best for videos with slower scene changes, such as town hall or conference room meetings, where the visuals remain relatively consistent. By using a lower sampling rate, you not only improve the accuracy and relevance of the summary but also speed up processing, as fewer frames need to be analyzed.

Frame Similarity Threshold

The frame similarity threshold helps determine the quality of the summary generated for your video by controlling how similar frames are handled. This setting allows you to decide which frames to discard during preprocessing. By removing extra frames that are too similar, you can speed up the processing and improve the accuracy of the summary, as the model doesn't need to analyze redundant frames. You can choose from the following options for the frame similarity threshold.

High: Frames that are highly similar to previously processed frames are discarded. This option may speed up the summary generation process and improve accuracy, as fewer frames containing relevant information are processed.
Medium: Frames that are moderately to highly similar to previously processed frames are discarded. This option further reduces the number of redundant frames, enhancing processing speed while retaining sufficient context for an accurate summary.
Low: Frames with low, moderate, and high similarity to previously processed frames are discarded. This significantly reduces the number of redundant frames obtained during preprocessing. While this may significantly speed up processing, it could affect the summary's accuracy since fewer frames are used for analysis.
Disabled: Prevent frames from being entirely discarded during preprocessing. The summarization model will analyze every frame, ensuring that even subtle visual information is fully captured. This is ideal for short videos, content with minimal scene changes, or scenarios requiring exhaustive visual analysis. This approach may provide the most detailed summaries, but processing time and resource usage may increase significantly, especially for longer videos or videos with high frame rates.

Impact of Video Summarization Settings

Video summarization quality and processing efficiency can be affected by the sampling rate and similarity threshold.

In the video summarization process, frames are first extracted from the video based on the configured sampling rate, which determines how many frames are processed per second. After the frames are sampled, the frame similarity threshold is applied to filter out redundant frames that are too similar to previously processed ones. This two-step approach—sampling frames followed by similarity filtering—helps balance processing speed and summary detail by controlling both the quantity and uniqueness of frames analyzed.

Understanding how these parameters interact can help you optimize summarization for your specific video content.

For example, lets try summarizing a 10-minute conference video with some slides changing occasionally and a speaker moving on stage.

Setting a high sampling rate (e.g., 8 frames per second) means the system extracts many frames capturing every small movement or slide change.
If you also set the frame similarity threshold to High, the system removes frames that look nearly identical, reducing the number of frames analyzed. This speeds up processing while still capturing the key visual changes like slide transitions or speaker gestures.
Conversely, a low sampling rate (e.g., 1 frame per second) combined with a Medium/Low similarity threshold means very few frames are analyzed, speeding up the process but potentially missing some finer visual details, such as subtle gestures or brief slide transitions.
If the similarity threshold is Disabled the system analyzes all extracted frames, even if many look alike. For a 10-minute conference video, this means every frame captured at the sampling rate (e.g., 1 frame per second) is processed, ensuring that even minimal visual changes like slight head movements or small slide adjustments are included, though this will increase processing time and resource usage.

By adjusting these two settings together, you can find the right balance between summary detail and processing efficiency depending on your video content type and resource availability.

Configuration Examples and Expected Outcomes

Here is a table that shows examples of various configurations and the use case where they would be ideal.

Sampling Rate (frames/sec)	Frame Similarity Threshold	Processing Speed	Summary Detail	Ideal Use Case & Recommendations
1	Low/Medium	Fast	Low detail	Static videos, e.g., conference meetings with minimal camera movement, where you are not concerned about capturing minor movements. Use for slow scene changes to save resources.
1	High	Moderate	Balanced detail	Conference meetings with minimal camera movement, where you want to capture minor movements such as peoples' movements. Use for slow scene changes to save resources.
1	Low	Fast	Low detail	Very long static videos like lecture recordings where you only want to record major slide changes. Maximize processing speed by discarding most similar frames.
1	Medium	Moderate	Balanced detail	Mixed-content videos like webinars or training sessions with some scene changes. Balance detail and speed effectively.
4	Disabled/Low	Slow	Highest detail	Sports clips or bodycam footage involving very fast-moving events where capturing every movement matters. Use for short videos that requires exhaustive detail capture.

Audio and Video Summarization

Audio and Video Summarization in VIDIZMO combines video and audio summarization techniques to create a more comprehensive and accurate summary. This method processes visuals (similar to video summarization) and text derived from audio transcriptions (similar to audio summarization). The summarization model delivers a richer, more contextually relevant summary by integrating visual and auditory data.

When selecting this summarization method, you will still have the ability to configure video summarization settings, such as the video sampling rate and frame similarity threshold. Transcriptions are also generated if your content contains audio with spoken words.

Language Output

For Audio and Video Summarization, even though the model analyzes the audio transcription in its original language (e.g., Spanish), the final output summary is generated in English. This means if you process a video with Spanish audio, the summarization model used by the VIDIZMO Vision Indexer processes both the Spanish audio transcription and the visual frames but produces the summary text exclusively in English.

The transcription output generated by Audio and Video Summarization will always correspond to the base language of the audio that's analyzed. For instance, in the example of a video with Spanish audio, the transcription will be in Spanish, while the summary text will be in English.

Processing Time and Sizing

Understanding processing times and resource requirements based on video length, quality, and summarization type is crucial for effective resource planning and operational efficiency.

The following table provides estimates for summarization processing time for various video lengths and qualities. It provides estimates for Video summarization and Video and Audio summarization

Video Summarization Processing Times

Video Length	Quality	Processing Time	Video Analysis	Summarization
1 hour	360p	2 min 12 sec	1 min 55 sec	17 sec
2 hours	480p	1 min 16 sec	1 min 13 sec	3 sec
3 hours	360p	6 min 02 sec	4 min 56 sec	1 min 6 sec
4 hours	360p	2 min 37 sec	2 min 27 sec	10 sec
5 hours	360p	3 min 24 sec	3 min 17 sec	7 sec

Video and Audio Summarization Processing Times

Video Length	Quality	Processing Time	Transcription	Video Analysis	Summarization
1 hour	360p	8 min 48 sec	8 min 20 sec	2 min 4 sec	28 sec
2 hours	480p	25 min 7 sec	24 min 43 sec	1 min 14 sec	24 sec
3 hours	360p	36 min 39 sec	34 min 25 sec	4 min 41 sec	2 min 14 sec
4 hours	360p	52 min 1 sec	51 min 34 sec	2 min 25 sec	27 sec
5 hours	360p	1 hr 3 min 29 sec	1 hr 2 min 9 sec	3 min 15 sec	1 min 20 sec

Note: The process for Transcription and Video Analysis occurs in parallel.

Key Points

Video only summarization is significantly faster than combined Video and Audio summarization because it processes visual frames without audio transcription.
Video and Audio summarization processing time is largely influenced by transcription, which takes the majority of the time, especially for longer videos. Choose Video-only summarization for faster processing when visual context suffices or audio content is less important.
Processing times depend on video resolution and length, but other factors such as content complexity and scene changes can also affect performance.
Video quality differences (360p vs 480p) influence frame extraction complexity and may impact processing speed.

Forbidden Words

The Forbidden Words parameter is used for content moderation, ensuring that specific words are avoided in the generated output text. The primary use case is where adherence to strict language guidelines or policies is required.

Once you provide the forbidden words in the VIDIZMO Indexer, the application ensures they are not included in the generated summary, even if they are contextually the most appropriate choice. The application prevents forbidden words from appearing and replaces them with the next most probable word or phrase, ensuring the meaning remains intact. This replacement is done without distorting the overall context or idea of the original text.

Forbidden words are applied to all types of summarization in VIDIZMO, ensuring that these words are excluded from the generated summaries, regardless of whether you use document, audio, video, or combined summarization.

Key Consideration

For the input of this parameter, you need to provide word(s) separately as independent entries. Phrases or sentences are invalid input and will not work for effective moderation of the generated summary.

Account Metrics

In VIDIZMO, you can track the consumption of various resources, including AI Processing, which measures the number of AI-based activities performed in your VIDIZMO account. Summarization and its classes are part of the AI Processing category.

Each type of summarization (such as Audio Summarization or Document Summarization) is tracked separately in your consumption reports. This means you can monitor how much you use each summarization type independently.

To use these summarization features, you need to obtain bundles that define the amount of processing you’re allowed on your Portal. There are seperate bundles for each the summarization classes.

For more information regarding consumption reports, visit Consumption Reports for Deployment Overview.

System Requirements for AI Summarization Service​

Minimum Requirements​

Recommended Requirements​

Concept​

Document Summarization​

Audio Summarization​

Supported Languages for Audio and Document Summarization​

Video Summarization Process​

Video Sampling Rate​

Frame Similarity Threshold​

Impact of Video Summarization Settings​

Configuration Examples and Expected Outcomes​

Audio and Video Summarization​

Language Output​

Processing Time and Sizing​

Video Summarization Processing Times​

Video and Audio Summarization Processing Times​

Key Points​

Forbidden Words​

Key Consideration​

Account Metrics​

Read Next​